en 

o 

Ph. 
< 






Accelerating Image Reconstruction in Three-Dimensional Optoacoustic Tomography 
on Graphics Processing Units 

Kun Wang*/y Chao Huang*, ^fj Yu-Jiun Kao,^ Cheng-Ying Chou,^ 
Alexander A. Oraevsky,^ and Mark A. Anastasio^l^ 

^''Department of Biomedical Engineering, Washington University in St. Louis, 

St. Lams, MO 63130 

'^^ Department of Bio-Industrial Mechatronics Engineering, 

National Taiwan University, Taipei 106, Taiwan 

■^^ Tomo Wave Laboratories, 675 Bering drive. Suite 575, Houston, Texas, Houston, 

TX 77057 

(Dated: 9 April 2013) 



C/3 

O 



> 

i> 

o 

(N 

^' 

O 
m 



X 



Purpose: Optoacoustic tomography (OAT) is inherently a three-dimensional (3D) 
inverse problem. However, most studies of OAT image reconstruction still employ 
two-dimensional (2D) imaging models. One important reason is because 3D image 
reconstruction is computationally burdensome. The aim of this work is to accelerate 
existing image reconstruction algorithms for 3D OAT by use of parallel programming 
techniques. 

Methods: Parallelization strategies are proposed to accelerate a filtered backpro- 
jection (FBP) algorithm and two different pairs of projection/backprojection op- 
erations that correspond to two different numerical imaging models. The algo- 
rithms are designed to fully exploit the parallel computing power of graphic pro- 
cessing units (CPUs). In order to evaluate the parallelization strategies for the pro- 
jection/backprojection pairs, an iterative image reconstruction algorithm is imple- 
mented. Computer-simulation and experimental studies are conducted to investigate 
the computational efficiency and numerical accuracy of the developed algorithms. 
Results: The GPU implementations improve the computational efficiency by fac- 
tors of 1,000, 125, and 250 for the FBP algorithm and the two pairs of projec- 
tion/backprojection operators, respectively. Accurate images are reconstructed by 
use of the FBP and iterative image reconstruction algorithms from both computer- 
simulated and experimental data. 

Conclusions: Parallelization strategies for 3D OAT image reconstruction are pro- 
posed for the first time. These GPU-based implementations significantly reduce the 
computational time for 3D image reconstruction, complementing our earlier work on 
3D OAT iterative image reconstruction. 

Keywords: Optoacoustic tomography, photoacoustic tomography, thermoacoustic 
tomography, graphics processing unit (GPU), compute unified device architecture 
(CUDA) 
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I. INTRODUCTION 

Optoacoustic tomography (OAT), also known as photoacoustic computed tomography, 
is an emerging imaging modahty that has great potential for a wide range of biomedical 
imaging applications^"-. In OAT, a short laser pulse is employed to irradiate biological 
tissues. When the biological tissues absorb the optical energy, acoustic wave fields can be 
generated via the thermoacoustic effect. The acoustic wave fields propagate outward in 
three-dimensional (3D) space and are measured by use of ultrasonic transducers that are 
distributed outside the object. The goal of OAT is to obtain an estimate of the absorbed 
energy density map within the object from the measured acoustic signals. To accomplish 
this, an image reconstruction algorithm is required. 

A variety of analytic image reconstruction algorithms have been proposed^"-. These al- 
gorithms generally assume an idealized transducer model and an acoustically homogeneous 
medium. Also, since they are based on discretization of continuous reconstruction formulae, 
these algorithms require the acoustic pressure to be densely sampled over a surface that 
encloses the object to obtain an accurate reconstruction. To overcome these limitations, 
iterative image reconstruction algorithms have been proposed^"—. Although the optoacous- 
tic wave intrinsically propagates in 3D space, when applying to experimental data, most 
studies have employed two-dimensional (2D) imaging models by making certain assump- 
tions on the transducer responses and/or the object structures^'i^'i^'i^ii^i^^. An important 
reason is because the computation required for 3D OAT image reconstruction is excessively 
burdensome. Therefore, acceleration of 3D image reconstruction will facilitate algorithm 
development and many applications including real-time 3D PACT— i^. 

A graphics processing unit (GPU) card is a specialized device specifically designed for 
parallel computations^^. Compute unified device architecture (CUDA) is an extension of the 
C/Fortran language that provides a convenient programming platform to exploit the parallel 
computational power of GPUs^. The CUDA-based parallel programming technique has 
been successfully applied to accelerate image reconstruction in mature imaging modalities 
such as X-ray computed tomography (CT)^""— and magnetic resonance imaging (MRI)^. In 
OAT, however, only a few works on utilization of CPUs to accelerate image reconstruction 
have been reported^^i^. For example, the k-wave toolbox employs the NVIDIA CUDA Fast 
Fourier Transform library (cuFFT) to accelerate the computation of 3D FFT— . Also a 



GPU-based sparse matrix-vector multiplication strategy has been applied to 3D OAT image 
reconstruction for the case that the system matrix is sparse and can be stored in memory^S. 
However, there remains an important need to develop efficient implementations of OAT 
reconstruction algorithms for general applications in which the system matrix is too large 
to be stored. 

In this work, we propose parallelization strategies, for use with CPUs, to accelerate 3D 
image reconstruction in OAT. Both ffitered backprojection (FBP) and iterative image recon- 
struction algorithms are investigated. For use with iterative image reconstruction algorithms, 
we focus on the parallelization of projection and backprojection operators. Specifically, we 
develop two pairs of projection/backprojection operators that correspond to two distinct 
discrete-to-discrete (D-D) imaging models employed in OAT, namely the interpolation-based 
and the spherical-voxel-based D-D imaging models. Note that our implementations of the 
backprojection operators compute the exact adjoint of the forward operators, and therefore 
the projector pairs are 'matched—. 

The remainder of the article is organized as follows. In Section [Til we briefly review OAT 
imaging models in their continuous and discrete forms. We propose GPU-based paralleliza- 
tion strategies in Section lllli Numerical studies and results are described in Section IIVI and 
Section |V] respectively. Finally, a brief discussion and summary of the proposed algorithms 
are provided in Section |VT1 

II. BACKGROUND 

A. Continuous-to-continuous imaging models and analytic image 
reconstruction algorithms 

A continuous-to-continuous (C-C) OAT imaging model neglects sampling effects and 
provides a mapping from the absorbed energy density function A{r) to the induced acoustic 
pressure function p{r^,t). Here, t is the temporal coordinate, r &V and r* G S* denote the 
locations within the object support V and on the measurement surface S, respectively. A 
canonical OAT C-C imaging model can be expressed asi'i^'^^: 



where S{t) is the Dirac delta function, /3, cq, and Cp denote the thermal coefficient of vol- 
ume expansion, (constant) speed-of-sound, and the specific heat capacity of the medium at 
constant pressure, respectively. We introduce an operator notation Ticc to denote this C-C 
mapping. 

Alternatively, Eqn. ([1]) can be reformulated as the well-known spherical Radon transform 
(SRT)i^^: 

^(r^t)= [drA{r)S{cot-\r^-r\), (2) 

Jv 

where the function g{r^,t) is related to p{r^,t) as 

The SRT model provides an intuitive interpretation of each value of (7(r'^,t) as a surface 
integral of A{r) over a sphere centered at r'^ with radius tcQ. 

Based on C-C imaging models, a variety of analytic image reconstruction algorithms have 
been developed^"-. For the case of a spherical measurement geometry, an FBP algorithm in 
its continuous form is given by^: 
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A(r) = rp dr' "^^ ' +- ': ' , ,, (4) 



|r — r^^l Co dt 



where R^ denotes the radius of the measurement surface 5". 

B. Discrete-to-discrete (D-D) imaging models and iterative image 
reconstruction algorithms 

When sampling effects are considered, an OAT system is properly described as a 
continuous-to-discrete (C-D) imaging model^'^'^'^'^: 

H,K^, = h'(i)*t^ fdr^pir^,t) £o°t::.:ri, (5) 

where Q and K denote the total numbers of transducers (indexed by q) and the time 
samples (indexed by k) respectively. Sg is the surface area of the g-th transducer, which is 
assumed to be a subset of S; h^{t) denotes the acousto-electric impulse response (EIR) of 
each transducer that, without loss of generality, is assumed to be identical for all transducers; 
'*t' denotes a linear convolution with respect to time coordinate; and Aj is the temporal 



sampling interval. The vector u represents the lexicographically ordered measured voltage 
signals whose {qK + A;)-th element is denoted by [uj^^+fc. 

In order to apply iterative image reconstruction algorithms a D-D imaging model is 
required, which necessitates the discretization of A{y). The following A^- dimensional repre- 
sentation of the object function can be employed^i^^: 

7V-1 

A(r)^5^[a]„^„(r), (6) 

n=0 

where o; is a coefficient vector whose n-th element is denoted by [q:].„ and ipni^^) is the 
expansion function. On substitution from Eqn. (|6]) into Eqn. ([5]), where p{r^,t) is defined 
by Eqn. ([1]), one obtains a D-D mapping from a to u, expressed as 

u ^ Ha, (7) 

where each element of the matrix H is defined as 

Here, H is the D-D imaging operator also known as system matrix or projection operator. 
Note that the '~' in Eqn. (JTj) is due to the use of the finite-dimensional representation of 
the object function (i.e., Eqn. ([6])). No additional approximations have been introduced. 

Below we describe two types of D-D imaging models that have been employed in 
OAT— '2^'2E: the interpolation-based imaging model and the spherical-voxel-based imag- 
ing model. The quantities u, H, and a (or ipn) in the two models will be distinguished by 
the subscripts (or superscripts) 'int' and 'sph', respectively. 

1. Interpolation- based D-D imaging model 

The interpolation-based D-D imaging model defines the coefficient vector as samples of 
the object function on the nodes of a uniform Cartesian grid: 

[aint]„= /rfr(5(r-r„)A(r), n = 0, 1, ■ ■ ■ , iV - 1, (9) 

Jv 

where, r„ = (x„, i/n, Zn)'^ specifies the location of the n-th node of the uniform Cartesian grid. 
The definition of the expansion function depends on the choice of interpolation method^. If 
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a trilinear interpolation method is employed, the expansion function can be expressed as^: 

{(I - l^"^"l vi _ \y-y^\ ](i _ l^-^"h if It - T I I?/ - ?/ I b - 2; I < A 
0, otherwise 

where A^ is the distance between two neighboring grid points. 

In principle, the interpolation-based D-D imaging model can be constructed by substitu- 
tion from Eqns. iQ and ( TTOj) to Eqn. ([8]). In practice, however, implementation of the surface 
integral over Sg is difficult for the choice of expansion functions in Eqn. flTOj) . Also, imple- 
mentations of the temporal convolution and "Hcc"^"* usually require extra discretization 
procedures. Therefore, utilization of the interpolation-based D-D model commonly assumes 
the transducers to be point-like. In this case, the implementation of Hint is decomposed as 
a three-step operation: 

Uint = Hintaint = H^'DGaint, (11) 

where G, D, and H'^ are discrete approximations of the SRT (Eqn. ([2])), the differential 
operator (Eqn. ([3])), and the operator that implements a temporal convolution with EIR, 
respectively. We implemented G in a way^"^"^ that is similar to the 'ray-driven' imple- 
mentation of Radon transform in X-ray CT— , i.e, for each data sample, we accumulated 
the contributions from the voxels that resided on the spherical shell specified by the data 
sample. By use of Eqns. ([2]), (l6j), ([9]), and ( fTOJ) . one obtains: 

N-l Ni-lNj-1 

n=0 j=0 i=0 

where [g]q/<+fe ~ g{r^g,t)\t=kAt with r^ specifying the location of the g-th point-like trans- 
ducer, and Ni and Nj denote the numbers of divisions over the two angular coordinates of a 
local spherical coordinate system shown in Fig. [l]-(b). A derivation of Eqn. (TT2l) is provided 
in Appendix. The differential operator in Eqn. ([3]) is approximated as 

rx-k 1 _ (^ f[s]gK+k+l [g\qK+k~l\ _ r ] /,o\ 

where [Pint]gA'+fc ~ p{^l,t)\t=kAf Finally, the continuous temporal convolution is approxi- 
mated by a discrete linear convolution as^ 

K-l 

[H^Pint] g^_^^ = 2_^[^lk-l^N.[Pmt]qK+K = [Uint]<?X+fc, (14) 



IqK+k 

K=0 



where [h% = Ath''{t)\t=kAt 



2. Spherical-voxel-based D-D imaging model 

The spherical-voxel-based imaging model is also widely employed in 0AT-ii^i2Si^i^. It 
employs the expansion functions 

f 1, if |r-r„| < AJ2 
I 0, otherwise 

where r^ is defined as in Eqn. Qj. The n-th expansion function ^^^'^(r) is a uniform sphere 
that is inscribed by the n-th cuboid of a Cartesian grid. The n-th component of the coefficient 
vector asph is defined as: 

["sph]„ = ^/rfrCP\r)A(r), (16) 

where Kubc and V^ph are the volumes of a cubic voxel of dimension A^ and of a spherical 
voxel of radius As/2 respectively. 

Unlike the interpolation-based imaging model, by use of the expansion functions defined 
in Eqn. (TT5|) . the surface integral over Sg and 'Hcc'^n^^ in Eqn. ([8]) can be converted to a 
temporal convolution and calculated analyticallyi^i^. To avoid utilizing excessively high 
sampling rate to mitigate aliasing, the spherical-voxel-based imaging model can be conve- 
niently implemented in the temporal frequency domain as^^: 

^-1 1 . 

[^«ph]gL+i=Po(/)Xl ["«ph]n^^g(''«'/) ,_,^ ' for/ = 0,l,--- ,L-1, (17) 

n=0 ^'1 * f 

where Aj is the frequency sampling interval, and L denotes the total number of temporal- 
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The 



frequency samples indexed by /. A derivation of Eqn. (1171) can be found in Ref. 
function /iq(r„, /) represents the temporal Fourier transform of the spatial impulse response 
(SIR) of the g-th transducer for the source located at r„, expressed as: 

r exp(-i27r/i^:^^^) 



A^ /Tt/A,. 1 . /7r/A,. 

cos sm 

2co ^ Co > 2nf ^ Co ^ 



hV), (19) 



Also, Poif) is defined as 

PoU) = -3-^ 

where /i^(/) is the EIR in temporal-frequency domain. In summary, the imaging model can 
be expressed in matrix form as: 

Usph = HsphCKsph. (20) 
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3. Adjoints of the system matrices 

Iterative image reconstruction algorithms employ numerical implementations of the pro- 
jection operator, i.e., the system matrix H, as well as its adjoint, denoted by H^ — . The 
adjoint is also referred to as the backprojection operator. Note that for most practical ap- 
plications, H and H^ are too large to be stored in the random access memory of currently 
available computers. Therefore, in practice, the actions of H and H'^ are almost always 
calculated on the fly. The same strategy was adopted in this work. 

According to the definition of the adjoint operator—*^, Hj^^^. = G^D'I'H''' , where 

K-l 



K=0 



D^p;. 



(3 



,K^k SnC.Alk 



\[Pmt\gK+k-l I-Pintj 



qK+k+1 



b'] 



qK+k' 



and 



to-' 



GTg 



Q-lK-l Ni-lNj-1 

g=0 fc=0 i=0 j=0 

It can also be verified that the adjoint operator H^ ■ is given by: 

Q-lL-l 

= 5Z 5Z [usph]^^^^pS(/)^^(rn, /) 



['-*^intJn- 



(21) 
(22) 

(23) 



■Hgph^sph 



/=«A/ 



(24) 



where the superscript '*' denotes the complex conjugate. Unlike the unmatched backpro- 
jection opertors^ that are obtained by discretization of the continuous adjoint operator, 
utilization of the exact adjoint operator facilitates the convergence of iterative image recon- 
struction algorithms. 

C. GPU architecture and CUDA programming 



CUDA programming are briefly 



28 and 



29 



for additional details. 



The key features of GPU architecture and the basics of 
summarized in this section. We refer the readers to Refs. 

A GPU card contains multiple streaming multiprocessors. Each streaming multiprocessor 
is configured with multiple processor cores. For example, the Tesla C1060 possesses 30 
streaming multiprocessors with 8 processor cores on each; and the Tesla C2050 possesses 14 
streaming multiprocessors with 32 processor cores on eacb^S. The processor cores in each 
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multiprocessor execute the same instruction on different pieces of data, which is referred 
to as "single instruction, multiple data" (SIMD) model of parallel programming. In order 
to fully exploit the computing power of GPUs, one of the major challenges is to design 
a parallelization strategy fitting in the SIMD framework such that the largest number of 
processor cores can execute the computation simultaneously^. 

A GPU card has six types of memory that have varying capacities and different access 
rules and efficiencies: (1) Registers are assigned for each thread and have the fastest access. 
(2) Shared memory is assigned for each block and can be efficiently accessed by all threads in 
the block if designed appropriately. (3) Constant memory is read-only and can be accessed by 
all threads efficiently. (4) Texture memory is also read-only and is optimized for interpolation 
operations. (5) Global memory has the slowest access that takes hundreds times more clock 
cycles than does the computation of basic arithmetic operations. (6) Local memory is 
assigned for each thread but has a slow access as does the global memory. Therefore, an 
efficient GPU-based implementation in general requires a limited number of global and local 
memory access. 

CUDA is a platform and programming model developed by NVIDIA that includes a 
collection of functions and keywords to exploit the parallel computing power of GPUs^^. A 
CUDA parallel program is composed of a host program and kernels. The host program is 
executed by CPUs and launches the kernels, which are custom-designed functions executed 
by GPUs. A general parallel programming strategy is to launch multiple instances of a 
kernel and to run the multiple instances concurrently on GPUs. In CUDA, each instance 
of the kernel is named as a thread and processes only a portion of the data. A hierarchy of 
threads is employed: Threads are grouped into blocks, and blocks are grouped into a grid. 
Therefore, each thread is specified by a multi-index containing a block index and a thread 
index within the block. 



III. GPU-ACCELERATED RECONSTRUCTION ALGORITHMS 

In this section, we propose GPU-based parallelization strategies for the FBP algorithm 
and the projection/backprojection operations corresponding to the interpolation-based and 
the spherical-voxel-based D-D imaging models. 
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A. Measurement geometry 

We employed a spherical measurement geometry shown in Fig. [T]-(a). The measurement 
sphere was of radius R^ centered at the origin of the Cartesian coordinate system (or the 
equivalent spherical coordinate system). The polar angle 6'^ G [0, tt] was equally divided with 
interval Aga = n/Nr, starting from 6*^;^. At each polar angle, a ring on the sphere that was 
parallel to the plane z = can be specified, resulting Nr rings. On each ring, N^ ultrasonic 
transducers were assumed to be uniformly distributed with azimuth angle interval A^s = 
2iT/Ny. Hereafter, each azimuth angle will be referred to as a tomographic view. At each 
view, we assumed that Nt temporal samples were acquired and the first sample corresponded 
to time instance tmm- For implementations in temporal-frequency domain, we assumed that 
Nf temporal-frequency samples were available and the first sample corresponded to /min- 
The region to be reconstructed was a rectangular cuboid whose edges were parallel to the axes 
of the coordinate system and the left-bottom-back vertex was located at {x^imymm^ Zmm)- 
The numbers of voxels along the three coordinates will be denoted by N^, Ny, and Nz, 
respectively, totally N = N^NyN^ voxels. We also assumed the cuboid was contained in 
another sphere of radius R that was concentric with the measurement sphere shown in Fig. 
[B(b). 

B. Implementation of the FBP algorithm 

Central processing unit (CPU)-based implementations of continuous FBP formulae have 



been described in Refs. 



Bl 



Though the discretization methods vary, in general, three 
approximations have to be employed. Firstly, the first-order derivative term 5p(r*, t)/dt has 
to be approximated by a difference scheme up to certain order—. Secondly, the measurement 
sphere has to be divided into small patches, and the surface integral has to be approximated 
by a summation of the area of every patch weighted by the effective value of the integrand 
on the patch. Finally, the value of the integrand at an arbitrary time instance t = |r'^ — r|/co 
has to be approximated by certain interpolation method. 

In this study, we approximated the surface integral by use of the trapezoidal rule. As 
described earlier, the spherical surface was divided into NrN^ patches. For the transducer 
indexed by q that was located at r^ = {R^, 9^, 0^), the area of the patch was approximated 
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by {R'^y Ags A^a sin^* The value at time instance t = jr^ — r„|/co was approximated by the 
hnear interpolation from its two neighboring samples as: 

pK t) . .,-.., ^{k + l-~k) [p] ,^^, + {k-k) [p] „^,^^, (25) 



qK+k ' V ■•'JlfjqK+k+V 

=0 

where k = (Ir^ — r.„|/co — tmin)/Aj, and k is the integer part of k. Here p is a vector of 
lexicographically ordered samples of the pressure function p{r^,t), which is estimated from 
the measured voltage data vector u. Also, the first-order derivative term was approximated 
by: 



d 



1 



PK t) ^^ Kzl:^ ^ ^ ( H ,K+k+l - [P] gK+k) ■ (26) 



By use of these three numerical approximations, the discretized FBP formula was expressed 

as: 

Unlike the implementations of FBP formulas in X-ray cone beam CT— i^, we combined the 
filter and the linear interpolation. This reduced the number of visits to the global memory 
in the GPU implementation described below. 

We implemented the FBP formula in a way that is similar to the 'pixel-driven' imple- 
mentation in X-ray CT— , i.e., we assigned each thread to execute the two accumulative 
summations in Eqn. (1271) for each voxel. We bound the pressure data p to texture memory 
because it is cached and has a faster accessing rate. Therefore our implementation only 
requires access to texture memory twice and to global memory once. The pseudo-codes are 
provided in Algs. [1] and |2] for the host part and the device part respectively. Note that 
the pseudo-codes do not intend to be always optimal because the performance of the codes 
could depend on the dimensions of p and afbp. For example, we set the block size to be 
(A^2, 1,1) because for our applications, A^^ was bigger than A^^ and Ny and smaller than the 
limit number of threads that a block can support (i.e., 1024 for the NVIDIA Tesla C2050). 
If the values of A^^;, A^y, and A^^ change, we may need to redesign the dimensions of the grid 
and blocks. However, the general SIMD parallelization strategy remains. 
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C. Implementation of Hint and H; 
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The forward projection operation HintCKint is composed of three consecutive operations 
g = GcKint, Pint = Dg, and Uint = H^pint that are defined in Eqns. (IT^ . (IT^ . and (IT^ . 
respectively. Both the difference operator D and the one- dimensional (ID) convolution H'^ 
have low computational complexities while the SRT operator G is computationally burden- 
some. Hence, we developed the GPU-based implementation of G while leaving D and H'^ 
to be implemented by CPUs. 

The SRT in OAT shares many features with the Radon transform in X-ray CT. Thus, our 
GPU-based implementation is closely related to the implementations of Radon transform 
that have been optimized for X-ray CT— "— . The surface integral was approximated ac- 
cording to the trapezoidal rule. Firstly, the integral surface was divided into small patches, 
which is described in the Appendix. Secondly, each patch was assigned an effective value of 
the object function by trilinear interpolation. The trilinear interpolation was calculated by 
use of the texture memory of CPUs that is specifically designed for interpolation. Finally, 
GPU threads accumulated the areas of patches weighted by the effective values of the object 
function and wrote the final results to global memory. The pseudo-codes for implementation 
of G are provided in Algs. [3] and H] for the host part and the device part, respectively. Note 
that we employed the "one- level" -strategy^, i.e., each thread calculates one data sample. 
Higher level strategies have been proposed to improve the performance by assigning each 
block to calculate multiple data samples^, which, however, caused many thread idles in 
OAT mainly because the amount of computation required to calculate a data sample varies 
largely among samples for SRT. 

Implementation of the backprojection operator Hj^^ was very similar to the implementa- 
tion of Hint- The operators D^ and H*^^ were calculated on CPUs while G^ was calculated 
by use of CPUs. The pseudo-codes are provided in Algs. [5] and [61 We made use of the 
CUDA function 'atomicAdd' to add weights to global memory from each thread. 

D. Implementation of Hgph and H^ . 

Implementation of the forward projection operation for the spherical- voxel-based imaging 
model is distinct from that of the interpolation-based model. The major difference is that 
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calculation of each element of the data vector for the spherical-voxel-based imaging model 
requires the accumulation of the contributions from all voxels because the model is expressed 
in the temporal frequency domain. Because of this, the amount of computation required to 
calculate each data sample in the spherical-voxel-based imaging model is almost identical, 
simplifying the parallelization strategy. 

We proposed a parallelization strategy that was inspired by one applied in advanced MRI 
reconstruction^^ and is summarized as follows. Discrete samples of Po{f) defined in Eqn. 
( TTOl) were precalcualted and stored as a vector po in constant memory. Because the size 
of the input vector Cisph is often too large to fit in the constant memory, we divided oisph 
into sub-vectors that matched the capacity of the constant memory. We employed a CPU 
loop to copy every sub- vector sequentially to the constant memory and call the GPU kernel 
function to accumulate a partial summation. The major advantage of this design is that the 
total number of global memory visits to calculate one data sample is reduced to the number 
of sub-vectors. 

Implementation of the projection operator for the spherical-voxel-based imaging model 
generally involves more arithmetic operations than does the interpolation-based imaging 
model. Moreover, the spherical-voxel-based imaging model has been employed to compen- 
sate for the finite aperture size effect of transducers^i^^, which makes the computation even 
more burdensome. Because of this, we further developed an implementation that employed 
multiple GPUs. The pseudo-codes of the projection operation are provided in Algs. [71 El 
andm We created A'pth pthreads on CPUs by use of the 'pthread.h' library. Here, we denote 
the threads on CPUs by 'pthread' to distinguish from threads on GPUs. We divided the 
input vector agph into A^pth sub- vectors (denoted by apth's) of equal size and declared an out- 
put vector u' j^ of dimension NpthNrNyNf. By calling the pthread function 'fwd_pthread', 
Nsph pthreads simultaneously calculated the projection. Each pthread projected an ctpth to 
a partial voltage data vector uL^ that filled in the larger vector u' h- Once all pthreads 
finished filling their u'^j^ into u' j^, the projection data Ugph were obtained by a summation 
of the iVpth Upth's. 

Implementation of the backprojection operator was similar except the dividing and loop- 
ing were over the vector Ugph instead of ctsp^. The pseudo-codes for the backprojection 
operation are provided in Algs. [TOl [HI and [121 
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IV. DESCRIPTIONS OF COMPUTER SIMULATION AND 
EXPERIMENTAL STUDIES 

The computational efficiency and accuracy of the proposed GPU-based implementations 
of the FBP algorithm and projection/backprojection operators for use with iterative image 
reconstruction algorithms were quantified in computer simulation and experimental OAT 
imaging studies. 

A. Computer-simulation studies 

Numerical phantom: The numerical phantom consisted of 9 uniform spheres that were 
blurred by a 3D Gaussian kernel possessing a full width at half maximum (FWHM) of 
0.77-mm. The phantom was contained within a cuboid of size 29.4 x 29.4 x 61.6-mm^. A 
2D image corresponding to the plane y = through the phantom is shown in Fig. |2]-(a). 

Smulated projection data: The measurement surface was a sphere of radius R^ = 65-mm. 
corresponding to an exsiting OAT imaging systen>2^i^. As described in Section llllt ideal 
point-like transducers were uniformly distributed over 128 rings and 90 tomographic views. 
The 128 rings covered the full it polar angle, i.e., ^^j^ = 7r/256, while the 90 views covered 
the full 27r azimuth angle. The speed of sound was set at Cq = 1.54-mm//is. We selected the 
Griineisen coefficient as F = f3cl/Cp = 2, 000 of arbitrary units (a.u.). For each transducer, 
we analytically calculated 1022 temporal samples of the pressure function at the sampling 
rate of /sam = 20-MHz by use of Eqn. ([1]). Because we employed a smooth object function, 
the pressure data were calculated by the following two steps: Firstly, we calculated temporal 
samples of pressure function Pus(r'^,t) that corresponds to the 9 uniform spheres byi^^^ 

i=o I 0, otherwise 

where r,, R^ and Ai denote the center location, the radius and the absorbed energy density 
of the 2-th sphere, respectively. Subsequently, we convolved Pusli"**? t) with a one-dimensional 
(ID) Gaussian kernel with FWHM = 0.5-/is^ to produce the pressure data. From the sim- 
ulated pressure data, we calculated the temporal-frequency spectrum by use of fast Fourier 
transform (FFT), from which we created an alternative data vector that contained 511 fre- 
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quency components occupying (0, 5]-MHz for each transducer. The simulated projection 
data in either the time domain or the temporal frequency domain will hereafter be referred 
to as "128 X 90"-data. By undersampling the "128 x 90"-data uniformly over rings and 
tomographic views, we created three subsets that contained varying number of transducers. 
These data sets will be referred to as "64 x 90" -data, "64 x 45" -data, and "32 x 45" -data, 
where the two numbers specify the number of rings and the number of tomographic views, 
respectively. 

Reconstruction algorithms: The GPU accelerated FBP algorithm was employed to recon- 
struct the object function sampled on a 3D Cartesian grid with spacing A^ = 0.14-mm. The 
dimension of the reconstructed images CKfbp was 210 x 210 x 440. 

We employed an iterative image reconstruction algorithm that sought to minimize a pe- 
nalized least-squares (PLS) objective^i^. Two versions of the reconstruction algorithm were 
developed that utilized the interpolation-based imaging model and the spherical- voxel-based 
imaging model respectively. The two versions sought to solve the optimization problems by 
use of the linear conjugate gradient (CG) method^i^: 

Aint = argmin ||u - Hintaint||^ + nR{cXint), (29) 

and 

dsph = argmin ||u - HsphCKsphU^ + /i-R(Q;sph), (30) 

Ctsph 

respectively, where R{cy.) is a regularizing penalty term whose impact is controlled by the 
regularization parameter /i. The penalty term was employed only when processing the 
experimental data as described in Section IIVJ -B. The reconstruction algorithms required 
computation of one projection and one backprojection operation at each iteration. Hereafter, 
the two reconstruction algorithms will be referred to as PLS-Int and PLS-Sph algorithms, 
respectively. We set A^ = 0.14-mm. Therefore, both the dimensions of CKint and a.sph were 
210 X 210 X 440. 

Performance assessment: We compared the computational times of 3D image reconstruction 
corresponding to the GPU- and CPU-based implementations. The CPU-based implementa- 
tions of the PLS-Int and PLS-Sph algorithms take several days to complete a single iteration 
even for the "32 x 45" -data. Therefore, we only recorded the computational time for the 
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CPU-based implementations to complete a single iteration when the data vector contained 
a single transducer. We assumed that the computational times were linearly proportional 
to the number of transducers in the data sets because the CPU-based implementations are 
sequential. 

The CPU-based implementations employed the single-precision floating-point format 
rather than the conventional double-precision utilized by CPU-based implementations. In 
order to quantify how the single-precision floating-point format would degrade the image ac- 
curacy, we calculated the root mean square error (RMSE) between the reconstructed image 
and the phantom defined by: 



RMSE = -\/ i^(a - a)T(a - a), (31) 

where ct and a are the samples of the phantom and the coefficients of the reconstructed 
images respectively. 

Hardware specifications: All implementations were tested on the platform consisted of dual 
quad-core Intel(R) Xeon (R) CPUs with a clock speed 2.40-CHz. The CPU-based imple- 
mentations of the FBP and the PLS-Int algorithms were tested on a single Tesla C2050 
GPU, while the PLS-Sph algorithm was tested on 8 Tesla C1060 CPUs. 

B. Experimental studies 

The FBP, PLS-Int and PLS-Sph algorithms were investigated by use of an existing data 
set corresponding to a live mouse^"^. The scanning geometry and dimensions were the 
same as those employed in the computer-simulation studies except that only 64 rings were 
uniformly distributed over the polar angle ranging from 14° to 83°. The transducers were of 
size 2 X 2-mm^. The raw data were acquired at 180 tomographic views, which are referred 
to as "full data". We undersampled the "full data" uniformly over the tomographic views, 
constructing a subset containing 45 tomographic views. The subset will be referred to as 
"quarter data" . 

Unlike in the idealized computer-simulation studies, the transducer response has to be 
compensated for when processing the experimental data. When implementing the FBP 
algorithm, the EIR was compensated for by a direct Fourier deconvolution, expressed in 
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temporal frequency domain as^: 

Pir\ f) = ^^W{f), (32) 

where W{f) is a window function for noise suppression. In this study, we adopted the Hann 
window function defined as: 

1 



Wif) = - l-cos(7r^^V^) , (33) 



fc 

where the cutoff frequency was chosen as fc = 5-MHz. When applying iterative image 
reconstruction algorithms, the transducer effects were implicitly compensated for during 
iteration by employing imaging models that incorporates the transducer charactertics^"^. 
We incorporated the EIR into the interpolation-based imaging model while incorporating 
both the EIR and the SIR into the spherical-voxel-based imaging model. 

For both PLS-Int and PLS-Sph algorithms, we employed a quadratic smoothness penalty 
to mitigate measurement noise^: 

N-l 

i?(a) = Y, {Hn - [C^UY + (Mn - Hnf + (M™ " Hnf , (34) 

n=0 

where Ux, Uy and n^ were the indices of the neighboring voxels before the n-th voxel along 
the three Cartesian axes, respectively. 

V. RESULTS 

A. Computational efRciency 

As shown in Tabled the GPU-based implementations took less than 0.1%, 0.8% and 0.4% 
of the computational times required by corresponding CPU-based implementations for the 
FBP, the PLS-Int, and the PLS-Sph algorithms, respectively. The relative computational 
times for the GPU-based implementations are nearly linearly proportional to the amount of 
data. Note that the "64x90"-data and the "quarter data" are of the same size. However, the 
computational times of the "quarter data" are more than L8 times those of the "64 x 90"- 
data. This is because the calculation of the SIR increases the computational complexity of 
the reconstruction algorithm. 
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B. Computational accuracy 

Images reconstructed by use of the CPU- and CPU-based implementations of the FBP 
algorithm are almost identical. From the "128 x 90" -data, in which case, transducers were 
densely distributed over the measurement surface, both implementations reconstructed ac- 
curate images, as shown Fig. [2]-(b) and -(c). The profiles along the three arrows in Fig. [2] 
are plotted in Fig. |5]-(a), suggesting a nearly exact reconstruction. As expected, when the 
amount of measurement data are reduced, the reconstructed images contain more artifacts 
as shown in Fig. [31 However, the images reconstructed by use of CPU- and CPU-based im- 
plementations remain indistinguishable. The plots of the RMSE versus the amount of mea- 
surement data employed in Fig. [6] overlap, also suggesting the single-precision floating-point 
format employed by the CPU-based implementation has little impact on the computational 
accuracy. 

The CPU-based implementations of the PLS-Int and PLS-Sph algorithms both recon- 
structed accurate images as displayed in Fig. HI As expected, the images reconstructed by 
use of both iterative algorithms contain fewer artifacts than do those reconstructed by use of 
the FBP algorithm from the same amount of data. Unlike the images reconstructed by use 
of the FBP algorithm from the "64 x 90" -data (Fig. [3]- (a) or -(d)), the images reconstructed 
by use of both iterative algorithms (Fig. |4]-(a) and -(d)) appear to be identical to the nu- 
merical phantom. The profiles along the two arrows in Fig. Il]-(a) and -(d) are plotted in 
Fig. |5]-(b), further confirming the computational accuracy of iterative image reconstruction 
algorithms. The plots of the RMSE versus the amount of measurement data employed in 
Fig. suggest the iterative image reconstruction algorithms in general outperform the FBP 
algorithm from the same amount of data. 

C. Experimental results 

The maximum intensity projection (MIP) of the 3D mouse images reconstructed by use 
of the CPU-based implementations reveal the mouse body vasculature as shown in Fig. 
[71 Images reconstructed by use of both the PLS-Int and the PLS-Sph algorithms appear to 
have cleaner background than do the images reconstructed by use of the FBP algorithm from 
the same amount of data. All images reconstructed by iterative algorithms were obtained 
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by 20-iterations starting with uniform zeros as the initial guess. The PLS-Int algorithm 
took approximately a half day and 2 days to process the "quarter data" and the "full 
data" respectively. The PLS-Sph algorithm took approximately one day and 4 days to 
process the "quarter data" and the "full data" respectively. Alternatively, if the CPU-based 
implementations were utilized, the PLS-Int algorithm would take an estimated 68 days and 
277 days to process the "quarter data" and the "full data" respectively. The PLS-Sph 
algorithm would take an estimated 275 days and 1, 100 days to process the "quarter data" 
and the "full data" respectively. 

VI. DISCUSSION AND CONCLUSION 

In this study, we developed and investigated GPU-based implementations of the FBP 
algorithm and two pairs of projection/backprojection operators for 3D OAT. Our imple- 
mentation of the FBP algorithm improved the computational efficiency over 1,000 times 
compared to the CPU-based implementation. This work complements our earlier studies 
that demonstrated the feasibility of 3D iterative image reconstruction in practice^ii^. 

Our current implementations of the iterative image reconstruction algorithms still re- 
quire several days to process the densely sampled data set, which, however, can be further 
improved. Firstly, the amount of measurement data required for accurate image reconstruc- 
tion can be further reduced by developing advanced image reconstruction methods^*^'^'^. 
Secondly, the number of iterations required can be reduced by developing fast-converging 
optimization algorithms^^i^i^. 

The proposed parallelization strategies by use of CPUs are of general interest. The 
implementation of the FBP algorithnt^ can be adapted to other analytic image reconstruc- 



57-59. We demonstrated the fea- 



tion algorithms, including those described in Refs. |5|, [Zl, 
sibility of PLS algorithm that utilized the proposed GPU-based implementations of the 
projection/backprojection operators. By use of these implementations, many advanced im- 
age reconstruction algorithms may also be feasible in practice^^. Though we described 
our parallelization strategies for the projection/backprojection operators that utilized two 
discrete-to-discrete imaging models, these strategies can also be applied to other D-D imag- 
ing models^iii'^'^. Therefore, the proposed algorithms will facilitate the further investiga- 
tion and application of advanced image reconstruction algorithms in 3D OAT. 
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APPENDIX 

Derivation of Equation flT2l) 

The integrated data function g{r'^,t) in Eqn. (^, evaluated at the g-th transducer and 
the k-th time instance, can be expressed as: 



g{rl,t) = / drA{r), (35) 

t=kAt J \r^^-r\=kcoAt 

where r* denotes the location of the g-th point-like transducer. We defined a local coordinate 
system, distinguished by a superscript 'tr', centered at the g-th transducer with the 2;*''-axis 
pointing to the origin of the global coordinate system as shown in Fig. [I]-(b). Assuming the 
object function A{r) is compactly supported in a sphere of radius R, the integral surface is 
symmetric about the z^'^-a.xis. Thus, the orientations of the x^'^- and y^^-axes can be arbitrary 
within the z^"^ = plane. Representing the right-hand side of Eqn. fl35|) in the local spherical 
coordinate system, one obtains 

g{rl,t) =(A;coAOM dd'^ sinB'' rf</>*M(A;coAi, r, (/>*=•), (36) 

<=fcAt Jo Jo 

where 0^^^ is half of the apex angle of the cone that corresponds to the intersectional 
spherical cap as shown in Fig. [I]-(b). The polar angle 6'*'' and the azimuth angle c^*'' were 
discretized with intervals A^tr and A^^tr that satisfied 

kcoAtAgtr = A;co Ai sin 6**' A^tr = A^. (37) 

Therefore, Eqn. (136!) can be approximated by 

Ni-lNj-l 

g{rl,t) ^^IJ2H A{kcoA„9r,<Pf), (38) 

t=kAt 

i=0 j=0 

where Ni = [0'^.^J Agtr \ , Nj = [27r/A^trJ, Of = lAgtr, and (pf = jA^tr. We denoted 
by r,fc jj the location in the global coordinate system corresponding to the location vector 
{kcoAt, 91^, (j)J) in the local coordinate system in Eqn. (138|) . On substitution from the finite- 
dimensional representation Eqn. ([6]) into Eqn. fl38l) with a and ij^ n{r) defined by Eqns. ([9]) 
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and (fTOj) . respectively, we obtained: 

N-l 



t=kAt 



n=0 j=0 j=0 



qK+k' 



(39) 
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FIGURE CAPTIONS 

1. (a) Schematic of the 3D OAT scanning geometry, (b) Schematic of the local coordinate 
system for the implementation of interpolation-based D-D imaging model. 

2. Slices corresponding to the plane y = oi (a) the phantom and the images reconstructed 
by use of (b) the CPU-based and (c) the CPU-based implementations of the FBP algorithm 
from the "128 x 90" -data. 

3. Slices corresponding to the plane y = oi the images reconstructed by use of the FBP 
algorithm with (a) the CPU-based implementation from the "64 x 90" -data, (b) the CPU- 
based implementation from the "64 x 45" -data, (c) the CPU-based implementation from the 
"32 X 45" -data, (d) the CPU-based implementation from the "64 x 90" -data, (e) the CPU- 
based implementation from the "64 x 45" -data, and (f) the CPU-based implementation from 
the "32 X 45"-data. 

4. Slices corresponding to the plane y = oi the images reconstructed by use of the 
CPU-based implementations of (a) the PLS-Int algorithm from the "64 x 90" -data, (b) the 
PLS-lnt algorithm from the "64 x 45"-data, (c) the PLS-lnt algorithm from the "32 x 45"- 
data, (d) the PLS-Sph algorithm from the "64 x 90" -data, (e) the PLS-Sph algorithm from 
the "64 X 45"-data, and (f) the PLS-Sph algorithm from the "32 x 45" -data. 

5. Profiles along the line {x,y) = (— 6.58,0)-mm of the images reconstructed by use of (a) 
the CPU- and CPU-based implementations of the FBP algorithm from the "128 x 90"-data, 
and (b) the CPU-based implementations of the PLS-lnt and the PLS-Sph algorithms from 
the "64 X 90"-data. 

6. Plots of the RMSE against the amount of data by use of the FBP, the PLS-lnt and the 
PLS-Sph algorithms. 

7. MIP renderings of the 3D images of the mouse body reconstructed by use of the CPU- 
based implementations of (a) the FBP algorithm from the "full data", (b) the PLS-lnt 
algorithm from the "full data" with /i = 1.0 x 10^, (c) the PLS-Sph algorithm from the "full 
data" with /i = 1.0 x 10^, (d) the FBP algorithm from the "quarter data", (e) the PLS-lnt 
algorithm from the "quarter data" with /^ = 1.0 x 10'^, and (f) the PLS-Sph algorithm from 
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the "quarter data" with /i = 1.0 x 10^. The grayscale window is [0,12.0]. 
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Algorithm 1 Implementation of the FBP algorithm (on host) 
Input: p 

Output: CKfbp 

1: w = —CpR^ /S.QS /S.^s / [tt ficQ^t) {Precalculate the common coefficient} 

2: T_p ^ p {Bound data to texture memory} 

3: KJbp ((( {Ny,N^), (iV,,l,l) ))) (w, D.afbp) 

4: afbp i— D_Q!fbp {Copy data from global memory to host} 

Algorithm 2 Implementation of kernel K_fbp ((( {Ny,Nx), (A''^, 1, 1) ))) 
Input: uj, T_p, D-Ctfbp 

Output: D-Ctfbp 

I: X = (blockldx.y)As + Xmm; y = (blockldx.x)As + ymin; z = (threadldx.x)As + z^i^ 

2: S = 

3: for n^ = to A^r — 1 do 

4: 9' = nrAes + 0^;^ ; z' = R' cos9'; r' = R' sm9'; w' = w sin 6' 

5: for n^, = to A'^y — 1 do 

6: (p^ = n„A<^s + 0^;^; x* = r* cos (j)^; y^ = r^ suKf)^ 

7: t= {{x - x'f + {y- y'f + {z - z'fyl^ 

8: tn = (i/co - tmin)/At; nt = floor(i„) 

9: S+=w'|[(ntAt+tmm)/(tnAt+imm)-1.5]T_pK][n.,][nt] + [l.5-(ntAt+tmm)/(tnAt+ 

imin)]T_p[n,.][n.u][ni + 1] ^ {Fetch data from texture memory} 
10: end for 

11: end for 
12: D_ctfbp[blockIdx.y][blockIdx.x][threadIdx.x] = S 
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Algorithm 3 Implementation of g = Gccint (on host) 



Input: Qint 
Output: g 

1: T-Ctint <— Q!int {Bound data to texture memory} 

2: K^rt((( {N,,Nt), (iV., 1, 1) ))) (D_g) 

3: g -^ D_g {Copy data from global memory to host} 



Algorithm 4 Implementation of kernel K_srt ((( {Ny,Nt), {Nr, 1,1) ))) 
Input: D_g, T_Q:int 

Output: D_g 

1: f = (blockldx.y)coAi + cotmin; ^^ = (threadIdx.x)A0s +0^1^; (f)'^ = (blockIdx.x)A0s +i 



2: e'^^^ = arccos {{{R'f + P - R^)/{2tW)) 



3: S - 0; 0' - 9'^^^ 

4: while 6' > do 

5: z' = t cos 0'; r' = t sin 6'; cj)' = 

6: while 0' < 27r do 

7: x' = r' cos (^'; y' = r' sin 0' 

8: X = —x' sin 0' — (z' — R^) cos 0'; y = y'', z = x' cos 0' — (z' — R^) sin 6*' {Convert to 

global coordinate system} 

"• Xfi — yX Xminj/^s, y-n — yy yminj/^si -^-n — v^ ■2-minj/^s 

10: E +=tex3D(2;„,y„,z„) {Tri-linear interpolation} 

11: (/)' = </>' + A^/r' 

12: end while 

13: 9' = 6' - As/i 

14: end w^hile 

15: D_g[threadldx.x][blockIdx.x][blockIdx.y] = SA^ 
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Algorithm 5 Implementation of cx.[^^ = G^g' (on host) 



Input: g' 

Output: a-^^ 
1: T_g' ^— g' {Bound data to texture memory} 
2: K^rtT ((( {N,„Nt), (TV., 1,1) ))) {I).(x[J 
3: a[^^ •<— D_aJ^^ {Copy data from global memory to host} 



Algorithm 6 Implementation of kernel K_srtT ((( (N^^Nt), {Nr, 1,1) ))) 
Input: D_a;„j, T_g;„t 

Output: D_ajjj^ 

1: i= (blockldx.y)coAi + cotmin; 0' = (threadldx.x) A^s + 9'^^^; <l)' = (blockldx.x) A^s + </.^j^ 

2: e^,, = arccos {[{Wf + P- R^)/{2tW)) ; 9' = ^U 

3: while 0' > do 

4: z' = t cos 9'; r' = t sin 9'\ (/)' = 

5: while 0' < 2tt do 

6: x' = r' cos </>'; y' = r' sin 0' 

7: X = — x'sin^ — {z' — R') cos 9; y = y'] z = x' cos9 — (z' — R') sinO {Convert to 

global coordinate system} 

9: Ucc = floor(x„); Uy = floor(y„); n^ = floor(z„) 

10: D_a(j^Jn2][ny][nJ+=A^(na; + 1 - Xn){ny + 1 - yn){nz + 1 - 

^;„)T_g(^^[threadIdx.x][blockIdx.x][blockIdx.y] {Add weights to one of the eight 
neighboring nodes by use of 'atomicAdd'; Repeat this operation for all other seven 
neighboring nodes} 

11: </,' = (/,' + As/r' 

12: end while 

13: 9' = 9' - As/i 

14: end while 
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Algorithm 7 Implementation of iisph = HsphC^sph (on host) 



Input: Qsph, Po 
Output: lisph 

1: for npth = to iVpth - 1 do 

2: parmjwdarg[npth].rapth = n^th 

3: parm_fwdarg[npth].po = &;Po[0] 

4: pa,T:mIwdaicg[npth]-apth=&^^asph[npthNxNyN:,/Npth] 

5: parm_fwdarg[npth]-itpth=^iisph['^pth^r-^D-^/] {Pass addresses of arrays to each pthread} 

6: pthread_create(&:pthreads[npth], NULL, fwd_pthread, (void *)(parm_fwdarg+npth)) {Call 

function fwd_pthread} 

7: end for 

8: for npth = to iVpth - 1 do 

9: forn = to NrN.^Nf do 

10: ii,ph [n] += fi^pi^ [n + Upt^NrN^Nf] 

11: end for 

12: end for 
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Algorithm 8 Implementation of function fwd_ptliread (on host) 
Input: ripth, Po, «pth, Upth 
Output: u'^^^ 

1: C_Pq ^ Po {Copy from host to constant memory} 

2: for n^ = to iV^/iVpth - 1 do 

3: X = {hx + n^thNx/N^th.)^s + a^min 

4: for n^ = to A'y — 1 do 

O: y ^ fly^s I ymin 

6: C.Q!pth •^ ctpth[^x][?^j/][:] {Copy from host to constant memory} 

7: K_fwdsph((( {N,,Nr),{Nf,l,l) ))) {x , y , B.u'^,^) 

8: end for 

9: end for 
10: uLjj ^ I-'-iipth {C-opy from global memory to host} 
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Algorithm 9 Implementation of Kernel KJwdspli ((( {N^, N^), (Nj, 1,1) ))) 
Input: X, y, BJi'^^^, C.Qpth, C_po 

Output: D_u' j^ 
1: 6' = (blockldx.y)Ae. + ^^i^; ^ = (blockldx.x)A^. + ,^V^; / = (threadldx.x)Aj + /^i„ 
2: z^ = R^ cos6^; x^ = R^ sin 9^ cos (f>^ ; y^ = R^ sin 9^ sin (f>^ {Calculate locations of transduc- 
ers} 
3: S*" = 0; S* = {Initiate the partial summation including the real and imaginary parts} 
4: for n^ = to A^2 — 1 do 

6: d= ({x - x*)2 + {y- y'f + {z - z')A 

7: h''' = cos{27rfd/co)/{27rd); h^ = —sin{27rfd/co)/{2'Kd) {Calculate SIR; Example here 

assumes point-like transducers} 

8: T,^' += C-Qpth [nz] ( h^C-Po [threadldx.x] .r — /i*C_po [threadldx.x] .i 

9: S* += C_Qpth[?^5;] ( ^'"C_po [threadldx.x]. i + /i*C_po [threadldx.x]. r 

10: end for 

11: D_u[,^i^[blockIdx.y][blockIdx.x] [threadldx.x]. r+= S'" 

12: B_u'^^^ [blockldx.y] [blockldx.x] [threadldx.x] .i += S* 
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Algorithm 10 Implementation of cx'^ = H^ , u (on host) 



Input: ii, po 
Output: a^pj^ 

1: for npth = to iVpth - 1 do 

2: parm_bwdarg[npth].rapth = n^th 

3: parm_bwdarg[npth]-Po = &;po[0] 

4: pa.rm_hwdaig[npth]-Upth=&^iL[npthNrNyNf/Npth] 

5: pa.miJ:)wdai[g[npth].a''^^=&z,a''^[nptiiNxNyNz] {Pass addresses of arrays to each pthread} 

6: pthread_create(&:pthreads[npth], NULL, bwd_pthread, (void *)(parm_bwdarg+npth)) {Call 

function bwd_pthread} 
7: end for 

8: for npth = to iVpth - 1 do 
9: forn = to N^NyN^ do 



10: a^ph [n] += a^'ph [^ + npti,N,NyN, 

11: end for 

12: end for 
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Algorithm 11 Implementation of function bwd_pthread (on host) 
Input: ripth, Po, Upth, Qpth 

Output: ctpti^ 

1: C_Pq ^ Po {Copy from host to constant memory} 

2: for rir = to Nr/Npth - 1 do 

3: P = K + npthiV,/iVpth)Ae. + 0,^j„; z' = R' cos 9'; r' = R'smt 

4: for n„ = to A'^ — 1 do 

5: (p'^ = UyA^s + (j)^^^; x^ = r'^ cos (j)^ ; y^ = r'^ sin (j)^ 

6: CLupth -^ Upth[?T-r][?^D][:] {Copy from host to constant memory} 

7: K_bwdsph((( {Ny,N^),{N,,l,l) ))) (x^ y^ z^ aaj^.J 

8: end for 

9: end for 
10: ct'L^ -^ T)-a''^^ {Copy from global memory to host} 
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Algorithm 12 Implementation of Kernel K_bwdspli ((( {Ny,Nx), (N^, 1, 1) ))) 
Input: x^ y^ 2;^ 'D-a'^^^, Cupth, C_po 

Output: B_a';,^ 

1: X = (blockldx.y)As + Xmm; y = (blockldx.x)As + ymm, z = (threadldx.x)As + Zmin 

2: d= {{x — x*)^ + (y — y*)^ + (-2 ~ z^)'^ J ; S = {Initiate the partial summation} 

3: for n/ = to iV/ - 1 do 

4: / = n/A/ + /min 

5: /i^ = cos(27r/(i/co)/(27r(i); /i* = - sin(27r/(i/co)/(27r(i) {Calculate SIR; Example here 

assumes point-like transducers} 
6: S+= C_Upth[n/].rf/l'■(lpo[n/].r-/^*CIpo[n/].^j+C_iipth[?^/]•^f/l*G.po[n/].r + /^''C_^^ 

7: end for 
8: D_Q|^^j^[blockIdx.y][blockIdx.x][threadIdx.x]+= E 
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TABLE I. Computational times of the 3D image reconstructions by use of the CPU- and GPU- 
based implementations 





FBP 


[sec] 


PLS-Int 


[min/iteration] 


PLS-Sph [min/iteration] 


Data sets 


CPU 


GPU 


CPU 


GPU 


CPU 


GPU 


"32 X 45" 


6,189 


6 


2,448 


20 


7,961 


22 


"64 X 45" 


12,975 


12 


4,896 


35 


15,923 


43 


"64 X 90" 


26, 190 


23 


9,792 


68 


31,845 


86 


"128 X 90" 


53,441 


46 


- 


- 


- 


- 


"quarter data" 


12,975 


12 


4,896 


35 


19, 776 


78 


"full data" 


53,441 


46 


19,968 


137 


79, 177 


313 
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